Algorithm-based Fault Tolerance for Floating-point Operations in Massively Parallel Systems

نویسنده

  • Jennifer Rexford
چکیده

This paper considers the applicability of algorithm-based fault tolerance (ABFT) to massively parallel scientiic computation. Existing ABFT schemes can provide eeective fault tolerance at a low cost for computation on matrices of moderate size; however, the methods do not scale well to oating-point operations on large systems. This paper proposes the use of a partitioned linear encoding scheme to provide scalab-ility. Matrix algorithms employing this scheme are presented and compared to current ABFT schemes. It is shown that the partitioned scheme provides scalable linear codes with improved numerical properties with only a small increase in hardware and time overhead.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An approach to fault detection and correction in design of systems using of Turbo ‎codes‎

We present an approach to design of fault tolerant computing systems. In this paper, a technique is employed that enable the combination of several codes, in order to obtain flexibility in the design of error correcting codes. Code combining techniques are very effective, which one of these codes are turbo codes. The Algorithm-based fault tolerance techniques that to detect errors rely on the c...

متن کامل

A new fixed degree regular network for parallel processing

We propose a family of regular Cayley network graphs of degree three based on permutation groups for design of massively parallel systems. These graphs are shown to be based on the shuffle exchange operations, to have logarithmic diameter in the number of vertices, and to be maximally fault tolerant. We investigate different algebraic properties of these networks (including fault tolerance) and...

متن کامل

Deadlock - Free : Fault Tolerant Wormhole Routing in Mesh based Massively Parallel Systems *

In this paper we present a routing scheme which is extremely suited for use in massively parallel systems . The routing algorithm is fault-tolerant so that network failures will not stop the system. For reasons of scalability, the routing information is extremely compact, also when the network is injured . The wormhole routing technique guarantees very low routing latency . Moreover the routing...

متن کامل

A Software Implemented Fault-tolerance Layer for Reliable Computing on Massively Parallel Computers and Distributed Computing Systems

A novel architecture for a software-implemented fault-tolerance layer for application reliability on massively parallel computers and distributed computing systems is proposed. This is the rst attempt at providing a purely software-based, user-level solution for fault detection, reconnguration, and recovery in a parallel environment. The symmetrically distributed, multi-tiered layer envelopes u...

متن کامل

Reversible Logic Multipliers: Novel Low-cost Parity-Preserving Designs

Reversible logic is one of the new paradigms for power optimization that can be used instead of the current circuits. Moreover, the fault-tolerance capability in the form of error detection or error correction is a vital aspect for current processing systems. In this paper, as the multiplication is an important operation in computing systems, some novel reversible multiplier designs are propose...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007